fix(impala/pyspark): use regexp_replace to strip#11989
fix(impala/pyspark): use regexp_replace to strip#11989
regexp_replace to strip#11989Conversation
trim doesn't remove _f_strim does not remove f's
07b2fb4 to
b78c203
Compare
trim does not remove f's0ae77dc to
b5ad8f6
Compare
regexp_replace to strip
There was a problem hiding this comment.
Pull request overview
Updates Impala and PySpark string lstrip/rstrip/strip compilation to use regexp_replace for correct whitespace handling (notably form feed), and adjusts tests/snapshots accordingly.
Changes:
- Implement
LStrip/RStrip/Stripin the PySpark and Impala SQL compilers viaregexp_replacepatterns using\\s. - Extend backend string method tests with a form-feed case and remove PySpark/Databricks xfails for
lstrip/rstrip. - Refresh Impala SQL snapshots to match the new
REGEXP_REPLACEoutput; includes a few small Polars backend cleanups.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| ibis/backends/tests/test_string.py | Adds a \f test value and updates expectations; removes PySpark/Databricks lstrip/rstrip xfails. |
| ibis/backends/sql/compilers/pyspark.py | Implements visit_LStrip/visit_RStrip/visit_Strip using regexp_replace. |
| ibis/backends/sql/compilers/impala.py | Switches strip variants to regexp_replace to avoid incorrect trimming behavior. |
| ibis/backends/polars/init.py | Minor cleanup (typo fix, exception type, type annotation tweaks). |
| ibis/backends/impala/tests/snapshots/test_string_builtins/test_string_builtins/lstrip/out.sql | Updates snapshot to REGEXP_REPLACE(..., '^\\s+', ''). |
| ibis/backends/impala/tests/snapshots/test_string_builtins/test_string_builtins/rstrip/out.sql | Updates snapshot to REGEXP_REPLACE(..., '\\s+$', ''). |
| ibis/backends/impala/tests/snapshots/test_string_builtins/test_string_builtins/strip/out.sql | Updates snapshot to `REGEXP_REPLACE(..., '^\s+ |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ibis/backends/polars/__init__.py
Outdated
| /, | ||
| *, | ||
| params: Mapping[ir.Expr, object] | None = None, | ||
| params: Mapping[ir.Scalar, Any] | None = None, |
There was a problem hiding this comment.
The execute() params type annotation was narrowed to Mapping[ir.Scalar, Any], but this method passes params through to _to_dataframe()/compile() which are annotated to accept Mapping[ir.Expr, ...]. Because Mapping is invariant in its key type, this can introduce static type-checking errors. Consider aligning these annotations (e.g., keep Mapping[ir.Expr, Any] everywhere, or update the internal helpers to also accept Mapping[ir.Scalar, Any]).
| params: Mapping[ir.Scalar, Any] | None = None, | |
| params: Mapping[ir.Expr, object] | None = None, |
|
I agree that the expected behavior is that To be clear, my understanding is (please correct if I'm wrong): On main, the current behavior is:
After this change:
|
Description of changes
Use
regexp_replaceinstead oftrimmethods to implementstripmethods, due to issues handling the form feed character.This is almost certainly slower. I alternatively considered dropping
\ffrom the set of characters to trim; however, data can include form feeds (e.g. from legacy systems), so correctness seems preferable.As an added bonus, this fixes two xfails on the PySpark side.
Issues closed